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Abstract 

Lasso is a widely used regression technique to find sparse representations. When the dimension 
of the feature space and the number of samples are extremely large, solving the Lasso problem remains 
challenging. To improve the efficiency of solving large-scale Lasso problems, El Ghaoui and his colleagues 
have proposed the SAFE rules which are able to quickly identify the inactive predictors, i.e., predictors 
that have components in the solution vector. Then, the inactive predictors or features can be removed 
from the optimization problem to reduce its scale. By transforming the standard Lasso to its dual form, 
it can be shown that the inactive predictors include the set of inactive constraints on the optimal dual 
solution. In this paper, we propose a fast and efficient screening rule via Dual Polytope Projections 
(DPP), which is mainly based on the uniqueness and nonexpansiveness of the optimal dual solution 
due to the fact that the feasible set in the dual space is a convex and closed polytope. Moreover, we 
show that our screening rule can be extended to identify inactive groups in group Lasso. To the best of 
our knowledge, there are currently no "exact" screening rules for group Lasso. We have evaluated our 
screening rule using both synthetic and real data sets. Results show that our rule is more effective to 
identify inactive predictors than existing state-of-the-art screening rules. 



1 Introduction 



Data with various structures and scales comes from almost every aspect of daily life. To effectively extract 
patterns in the data and build interpret able models with high prediction accuracy is always desirable. One 
popular technique to identify important explanatory features is by sparse regularization. For instance, 



consider the widely used t\ -regularized least squares regression problem known as Lasso Tibshirani 1996 



The most appealing property of Lasso is the sparsity of the solutions, which is equivalent to feature selection. 
Suppose we have N observations and p predictors. Let y denote the N dimensional response vector and X = 
[xi , X2 , . . . , x p ] be the N x p feature matrix. The Lasso problem is formulated as the following optimization 
problem: 

1,1 X/3||! + AH/% (1) 



inf -lly 
pew 2 IIJ 



where A > is a regularization parameter. 



Lasso has achieved great success in a wide range of applications Chen et al. 2001 , Candes 2006 , Zhao 



and Yu 2006 , Bruckstein et al. 2009 , Wright et al. 2010 



developed to efficiently solve the Lasso problem 
Donoho and Tsaig| [2008] , |Friedman et al.|[2007 



Efron et al. 



and in recent years many algorithms have been 



2004 , Kim et al. 2007 , Park and Hastie 2007 



Becker et al. 2010 , Friedman et al. 2010 . However, when 



the dimension of feature space and the number of samples are very large, solving the Lasso problem remains 
challenging because we may not even be able to load the data matrix into main memory. The idea of a 



screening test proposed by El Ghaoui et al El Ghaoui et al. 2010a is to first identify inactive predictors 



1 



that have components in the solution and then remove them from the optimization. Therefore, we can 
work on a reduced feature matrix to solve Lasso efficiently. 

the "SAFE" rule discards x 7 - when 



In El Ghaoui et al. 2010a 



|xfy|<A-||x,|| 2 ||y|| 2 



A 



A, 



(2) 



where 



Tibshirani et al. 



2012 



x;y| is the largest parameter such that the solution is nontrivial. Tibshirani et al. 
proposed a set of strong rules which were more effective in identifying inactive 



predictors. The basic version discards x^ if 



|xfy| 



< 2A - A„ 



(3) 



However, it should be noted that the proposed strong rules might mistakenly discard active predictors, i.e., 



predictors which have nonzero coefficients in the solution vector. Xiang et al. Xiang et al. 2011 , Xiang and 



|Rama dge [2012 developed a set of screening tests based on the estimation of the optimal dual solution and 



they have shown that the SAFE rules are in fact a special case of the general sphere test. 

In this paper, we develop new efficient and effective screening rules for the Lasso problem; our screening 
rules are exact in the sense that no active predictors will be discarded. By transforming problem ([!]) to 
its dual form, our motivation is mainly based on three geometric observations in the dual space. First, the 
active predictors belong to a subset of the active constraints on the optimal dual solution, which is a direct 
consequence of the KKT conditions. Second, the optimal dual solution is in fact the projection of the scaled 
response vector onto the feasible set of the dual variables. Third, because the feasible set of the dual variables 
is closed and convex, the projection is nonexpansive with respect to A Bertsekas 2003 , which results in an 
effective estimation of its variation. 

The rest of this paper is organized as follows. We present the DPP screening rules for the Lasso problem 
in Section [g) S ection [3] extends the idea of DPP screening rules to identify inactive groups in group Lasso 



Yuan and Lin| [2006] . We have evaluated the proposed screening rules using both synthetic and real data. 
In Section |4j extensive experimental results demonstrate that the proposed rules are more effective than 
existing state-of-art screening rules. 



2 Screening Rules for Lasso via Dual Polytope Projections 

In this section, we first discuss the geometric properties of the dual formulation of problem ([!]) (Section 
2.1 ). Specifically, the optimal dual solution can be formulated as the projection of the scaled response vector 
onto the feasible set, which is a closed and convex polytope in the dual space. According to the properties 
of projection operators with respect to closed convex sets Bertsekas 2003 , the dual optimal is unique and 



nonexpansive. Based on the geometric properties of the dual optimal, we develop the fundamental principle, 
i.e., Theorem [I] which can be used to construct screening rules for Lasso. For illustrative purposes only, 
we provide Corollary [2] as a concrete example of the fundamental principle. We also reveal the connections 
between DPP rules and the sphere te st |Xiang et al.| |2011| . In section 2.2, we discuss the relation between 



dual optimal and LARS |Efron et al.| 2004 . As a straightforward extension of DPP rules, we develop the 



sequential version of DPP (SDPP) in Section 2.3 



2.1 Fundamental Screening Rules via Dual Polytope Projections 



2011| , |Xiang and Ramadge| |2012| , we do not assume y and all have unit 



Different from Xiang et alj 

length. We first transform problem M to its dual form (to make the paper self-contained, we provide the 
detailed derivation of the dual form in the supplemental materials): 



1, 



sup 



y\\i 



x 2 



subject to \xf 9\ 
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< 1, i 



-Jill 

l,2,...,p 



(4) 
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where is the dual variable. 

Since the feasible set, denoted by F, is the intersection of 2p half-spaces, it is a closed and convex 
polytope. From the objective function of the dual problem Q, it is easy to see that the optimal dual 
solution #* is a feasible which is closest to j. In other words, #* is the projection of j onto the polytope 
F. Mathematically, for an arbitrary vector w and a convex set C, if we define the projection function as 

^c(w) argmin ||u - w|| 2 , 
uec 



then 



0* = P F (y/X) = argmin ||0 - ^|| 2 . (5) 



eeF 

We know the optimal primal and dual solutions satisfy: 

y = X/T + A<9* (6) 
and the KKT conditions for the Lasso problem are 

[n Xie \[-i,i]ifr], = o (7) 

where denotes the k th component. 

By the KKT conditions in Eq. ([7]), if the inner product (#*) T x^ belongs to the open interval ( — 1,1), then 
the corresponding component [f3*]i in the solution vector /3*(A) has to be 0. As a result, is an inactive 
predictor and can be removed from the optimization. 

On the other hand, let 

dH(xi) = {z: z T Xi = 1} and Hfc)- = {z: z T Xi < 1} 

denote the hyperplane and half space determined by x$ respectively. Consider the dual problem Q; con- 
straints induced by each x^ are equivalent to requiring each feasible to lie inside the intersection of iJ(x^)_ 
and H(—x.i)-. If |(#*) T x^| = 1, i.e., either #* G H{pci)- or 0* G H{— x^)_, we say the constraints induced 
by x^ are active on 0* . 

We define the "active" set on 0* as 

Ze* |(n T x*l = 1,^1} 

where X = {1,2,. . . Otherwise, if 0* lies between dHfa) and dH(—x.i), i.e., |(#*) T x^| < 1, we can safely 
remove x^ from the problem because [f3*]i = according to the KKT conditions in Eq. (|7|). Similarly, the 
"inactive" set on 9* is defined as 1q* = X \ Xq* . 

Therefore, from a geometric perspective, if we know #*, i.e., the projection of ^ onto F, the predictors 
in the inactive set on 6* can be discarded from the optimization. It is worthwhile to mention that inactive 
predictors, i.e., predictors that have components in the solution, are not the same as predictors in the 
inactive set. In fact, by the KKT conditions, predictors in the inactive set must be inactive predictors since 
they are guaranteed to have components in the solution, but the converse may not be true. 

Motivated by the above geometric intuitions, we next show how to find the predictors in the inactive set 
on 0* . To emphasize the dependence on A, let us write 0*(A) and /?*(A). If we know exactly where Q*(\) 
is, it will be trivial to find the predictors in the inactive set. Unfortunately, in most of the cases, we only 
have incomplete information about 0*(A) without actually solving problem ([!]) or Q. Suppose we know the 
exact 0*(A') for a specific \' . How can we estimate 0*(A // ) for another \" and its inactive set? To answer 
this question, we start from Eq. ([5|; #*(A) is nonexpansive because it is a projection operator. We obtain 
the following result. 
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Theorem 1. For the Lasso problem, assume we are given the solution of its dual problem 9*(\ f ) for a specific 
A' . Let X" be a nonnegative value different from X . If the following holds: 



x^*(A')|<l-||x;||2||y||2 



1 1 

A 7 ~ A 77 



then [/?*(A")]i = 0. 

Proof. From the KKT conditions in Eq. ([7|), we know 

|xff(A")|<l^r(A")] ! = 0. 

By the dual problem 6*(\) is the projection of ^ onto the feasible set F. According to the projection 
theorem Bertsekas 2003 for closed convex sets, 6*(\) is continuous and nonexpansive, i.e., 



y 


y 






1 i 


A" 


A' 


2 


llylh 


A 77 ~ A 7 



Then 



||0*(A")-0*(A')|| 2 < 



|xfr(A")|<|xfr(A")-xfr(A')| 
+ |xfr(A')| 
<l| Xl || 2 ||(r(A")-r(A'))||2 
i i 



(8) 



(9) 



+ l-||x i || 2 ||y|| 2 

< INIbllylb 

+ l-||x,|| 2 ||y|| 2 



A" A' 
1 1 
A 77 ~ A/ 

1 1 

A 77 ~ A 7 



1 



which completes the proof. 



□ 



From theorem [T] it is easy to see our rule is quite flexible since every 6*(X') would result in a new 
screening rule. And the smaller the gap between A' and A", the more effective the screening rule is. By 
"more effective" , we mean a stronger capability of the screening rule in identifying inactive predictors. 

As an example, let us find out 9*(\ max ). Recall that X max = max; |xfy|. It is easy to verify x y is itself 
feasible. Therefore the oroiection of . y onto F is itself, i.e., 0^ (X rnax ) — y-^- — . Moreover, by noting that 
for VA > Xmax-, we have |x^y| < 1, i G X, i.e., all predictors are in the inactive set at #*(A), we conclude that 
the solution to problem |TJ) is 0. Combining all these together and plugging 0*(X max ) = A y into Eq. (14), 
we obtain the following screening rule. 

Corollary 2. DPP: For the Lasso problem let X max = max^ |x^y|. 

1. If\> X max , then [P% = 0,Vz e 1; 

2. Otherwise, if the following holds: 

t y 



a. 



<i-|Wl2||y|| 2 ( 



A A, 



then \P*{\)]i 



0. 
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Table 1: Illustration of the running time for DPP screening and for solving the Lasso problem after screening. 
T s : time for screening. TV time for solving the Lasso problem after screening. T : the total time. Entries 
of the response vector y are i.i.d. by a standard Gaussian. Columns of the data matrix X G $ft 1000x 100000 
are generated by = y + az where a is a random number drawn from the uniform distribution in [0,1]. 
Entries of z are i.i.d. by a standard Gaussian. X max — 0.95 and A/A ma:E =0.5. 







Lasso 


DPP 


DPP2 


DPP5 


DPP10 


DPP20 




(s) 




0.035 


0.073 


0.152 


0.321 


0.648 




(s) 




10.250 


9.634 


8.399 


1.369 


0.121 


T 


(s) 


103.314 


10.285 


9.707 


8.552 


1.690 


0.769 



Clearly DPP is most effective when A is close to Xmax- So how can we find a new 6*(\ f ) with A' < X max ? 
Note that Eq. (|6| is in fact a natural bridge which relates the primal and dual optimal solutions. As long 
as we know /3*(A / ), it is easy to get 0*(A') when A is relatively small, e.g., LARS [Efron et aL 2004 and 



Homotopy Osborne et al. 



Remark: Xiang et < 
to be inside a ball 110* - 



2000 algorithms. 



Xiang et aL] |2011| developed a general sphere test which says that if 0* is estimated 



q|| 2 < r, then 



|xf q| < (1 



Considering the DPP rules in Theorem [TJ it is equivalent to setting q = 0*(A') and r 



A' 



I. Therefore, 



different from the sphere test and Dome developed in Xiang et al. 2011 , |Xiang and Ramadge 2012 with 
the radius r fixed at the beginning, the construction of our DPP rules is equivalent to an "r" decreasing 
process. Clearly, the smaller r is, the more inactive predictors we can discard and the more effective the 
DPP rules will be. 



Remark: It is worthwhile to note that DPP is not the same as ST1 in Xiang et al.| |2011| and SAFE in 



El Ghaoui et al. 2010a . From the perspective of the sphere test, the radius of ST1/SAFE and DPP are the 
same. But the centers of ST1 and DPP are x/X and x/A max respectively, which leads to different formulas, 
i.e., Eq. Q and Corollary [5] 

2.2 DPP Rules with L ARS/Homotopy Algorithms 



It is well known that under mild conditions, the set {/?*(A) : A > 0} (also know as regularization path Mairal 
and Yul 



2000 



2012] ) is continuous piecewise linear jOsborne et al. 
The output of LARS or Homotopy algorithms is in fact a sequence of values like (/3*(A^), A^), (/3*(A^)^), . . ., 
where /3*(A^) corresponds to the zth breakpoint of the regularization path {/?*(A) : A > 0} and A^s are 



Efron et al.| [20M] , [MaTral and Yu||2012f T 
vain*** likP (R*7W>)\ WT\ (r*(\V)YWJ^ 



monotonically decreasing. By Eq. (|6|, once we get /3*(A^), we can immediately compute 8*(X^). Then 
according to Theorem 111 we can construct a DPP rule based on 0*(X^) and A^. For convenience, if the 
DPP rule is built based on 0*(A^), we add the index i as suffix to DPP, e.g., DPP5 means it is developed 
based on 0*(\&). 

It should be noted that LARS or Homotopy algorithms are very efficient to find the first few breakpoints of 
the regularization path and the corresponding parameters. For the first few breakpoints, the computational 
cost is roughly O(Np), i.e., linear with the size of the data matrix X. In Table [I] we report both the time 
used for screening and the time needed to solve the Lasso problem after screening. The Lasso solver is from 



the SLEP Liu et al. 2009 package. 



From Table [1] we can see that compared with the time saved by the screening rules, the time used for 
screening is negligible. The efficiency of the Lasso solver is improved by DPP20 more than 130 times. In 
practice, DPP rules built on the first few 0*(A^)'s lead to more significant performance improvement than 
existing state-of-art screening tests. We will demonstrate the effectiveness of our DPP rules in the experiment 
section. 

As another useful property of L ARS/Homotopy algorithms, it is worthwhile to mention that changes 
of the active set only happen at the breakpoints |Osborne et al. 2000 , Efron et al. 2004 , Mairal and 



5 



Yu 



2012 



Consequently, given the parameters corresponding to a pair of adjacent breakpoints, e.g., 
and^A^, the active set for A G is the same as A = A^. Therefore, besides the sequence of 

breakpoints and the associated parameters (/3*(A (0) ), A (0) ), . . . (/3*(A (/c) ), A (/c) ) computed by LARS/Homotopy 
algorithms, we know the active set for VA > \( k \ Hence we can remove the predictors in the inactive set 
from the optimization problem $\\. This scheme h as been embedded in DPP rules. 



Remark: Some works, e.g., Tibshirani et al. |2Q12| and El Ghaoui et al.| |2010b| , solve several Lasso 



problems for different parameters to improve the screening performance. However, the DPP algorithms do 
not aim to solve a sequence of Lasso problems, but just to accelerate one. The LARS/Homotopy algorithms 
are used to find the first few breakpoints of the regularization path and the corresponding parameters, instead 
of solving general Lasso problems. Therefore, different from Tibshirani et al.| |2Q12| and El Ghaoui et al 
[2010b| who need to iteratively compute a screening step and a Lasso step, DPP algorithms only compute 
one screening step and one Lasso step. 



2.3 Sequential Version of DPP Rules 

Motivated by the ideas of Tibshirani et al. 2012 and El Ghaoui et al.| |2010b| , we can develop a sequential 
version of DPP rules. In other words, if we are given a sequence of parameter values Ai > A2 > . . . > A m , we 
can first apply DPP to discard inactive predictors for the Lasso problem ([I]) with parameter being Ai. After 
solving the reduced optimization problem for Ai, we obtain the exact solution /3*(Ai). Hence by Eq. (|6|, we 
can find 0*(Ai). According to Theorem [I] once we know the optimal dual solution 0*(Ai), we can construct 
a new screening rule to identify inactive predictors for problem ([I]) with A = A2. By repeating the above 
process, we obtain the sequential version of the DPP rule (SDPP). 

Corollary 3. SDPP: For the Lasso problem |Ip ; suppose we are given a sequence of parameter values 
^max = Aq > Ai > . . . > A m . Then for any integer < k < m, if /3*(A&) is known and the following holds: 



x 



T y-X/3*(A fc ) 



Az, 



< 1 



U+l 



then [/3*(Afc + i)]i = 0. 



Wu et al. 



2009 



built 



Remark: There are some other related works on screening rules, e.g., Wu et al. 
screening rules for l\ penalized logistic regression based on the inner products between the response vector 
and each predictor; Tibshirani et al. |Tibshirani et al.| |2012| developed strong rules for a set of Lasso-type 



problems via the inner products between the residual and predictors; in Fan and Lv 2008 , Fan and Lv 



studied screening rules for Lasso and related problems. But all of the above works may mistakenly discard 



predictors that have non-zero coefficients in the solution. Similar to El Ghaoui et al. 2010a , Xiang et al. 



2011 , Xiang and Ramadge 2012 



our DPP rules are exact in the sense that the predictors discarded by 
our rules are inactive predictors, i.e., predictors that have zero coefficients in the solution. 



3 Extensions to Group Lasso 



To demonstrate the flexibility of DPP rules, we extend our idea to the group Lasso problem |Yuan and Lin 
[2006] : 

G G 



(10) 



3=1 



3=1 



where X^ G $l Nxn 9 i s the data matrix for the gth group and p = X^=i n g- 

The corresponding dual problem of (10) is (see detailed derivation in the supplemental materials): 

sup i||y|||-y||0-|||| 



(11) 



subject to ||Xj#||2 < y/ng, 9 = 1, 2, . . . , G 
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Similar to the Lasso problem, the primal and dual optimal solutions of the group Lasso satisfy: 

G 

y = £x^+A6>* (12) 



3=1 



and the KKT conditions are: 



if i*/o 



/ngu, ||u|| 2 < i if p; = 



(13) 



for <;=1,2,...,G. 

Clearly, if ||(#*) T XJ| 2 < we can conclude that /?* = 0. 

Consider problem (|ll|). It is easy to see that the dual optimal #* is the projection of j onto the feasible set. 
For each g, the constraint ||Xj0|| 2 < Jng confines to an ellipsoid which is closed and convex. Therefore, 
the feasible set of the dual problem (11) is the intersection of ellipsoids and thus closed and convex. Hence 
0*(A) is also nonexpansive for the group lasso problem. Similar to Theorem [I] we can readily develop the 
following theorem for group Lasso. 

Theorem 4. For the group Lasso problem, assume we are given the solution of its dual problem 9*(\') for 
a specific A'. Let X" be a nonnegative value different from X' . If the following holds: 



x^(A')|| 2 <v / ^-llx s l|F||y||2 



1 

A 7 



1 

A" 



(14) 



then I3*J\") = 0. 



Proof. From the KKT conditions in Eq. (13), we know 



l|X^*(A")||2<V%=>^(A") = 0. 



By the dual problem (11), 6*(\) is the projection of ^ onto the feasible set which is closed and convex. 
Note, the feasible set is in Tact the intersection of ellipsoids: 

{9: ||Xj0|| 2 < y^}, 5 = 1,2,..., G. 



According to the projection theorem |Bertsekas 2QQ3| for closed convex sets, 0*(A) is continuous and nonex- 
pansive, i.e., 



||0*(A'O-0*(A')|| 2 < 



Then 



r T /)* / \//\ 



y y 




i i 


A" A' 


= llylb 

2 


A 77 ~ A 7 


^*(A")- 


xJr(A') 


lb 



(15) 
(16) 



+ IIX 



Ta*f\'\ 



<||X fl || 2 ||((9*(A")-(9*(A'))||2 
1 1 



A' A'' 



|X fl |k||y||2 



< ||X fl || F ||y|| 2 



1 1 

A 77 ~ \i 



|x fl ||F||y||2 



1 

A 77 



which completes the proof. 

We use the fact that ||X ff ||2 < ||X 3 ||f in the last inequality of Eq. (16). The subscript || • \\p denotes the 
Frobenius norm. □ 
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Similar to the Lasso problem, let 



An 



max 

9 



|X?y| 



it is easy to see that A y is itself feasible, and Xmax is the largest parameter such that problem (10) has a 
nonzero solution. Similar to DPP and SDPP, we can construct GDPP and SGDPP for group Lasso. 



Corollary 5. GDPP: For the group Lasso problem 

1. IfX> \ max , p* g (X) = 0, V ff = 1, 2, . . . , G; 

2. Otherwise, if the following holds: 



; let X n 



max G 



l|x>|| 



< 



M|y|| 2 ( 



A A. 



then p*(\) = 0. 



Corollary 6. SGDPP: For the group Lasso problem suppose we are given a sequence of parameter 

values Xmax — Ao > Ai > . . . > A m . For any integer < k < m, if (3*(\k) is known and the following holds: 

xr y-E? = iX g /%(A fc ) 



>*g\\F 



|y|| 2 (- 



U+l 



Afc' 



then /3!(A fc+ i) = 0. 



4 Experiments 



We evaluated our screening rules on both synthetic and real data sets. To measure the performance of our 
screening rules, we compute the rejection rate, i.e., the ratio between the number of predictors discarded by 
screening rules and the actual number of zero predictors in the ground truth. Because the DPP rules are 
exact, i.e., no active predictors will be mistakenly discarded, the rejection rate will be less than one. 

which 



We compare the performance of DPP with Dome Xiang and Ramadge 2012 , Xiang et al. 



2011 



achieves state-of-art performance for the Lasso problem among exact screening rules |Xiang and Ramadge 
|2012| . We evaluate GDPP and SGDPP for the group Lasso problem on three synthetic data sets in section 

exacf screening rules for the group Lasso problem at this point. For SAFE 



|4.2| We are not aware of any 
and Dome, it is not straightforward to extend them to the group Lasso problem. 

[2oTT 



Similarly to previous works IXiang et al. 



we do not report the computational time saved by 
screening because it can be easily computed from the projection ratio. Specifically, if the Lasso solver is 
linear in terms of the size of the data matrix X, a K% rejection of the data can save K% computational 
time. 



4.1 DPPs for the Lasso Problem 



We compare the performance of DPP rules and Dome on: (a) three synthetic datasets with different dimen- 
sions; (b) the MNIST handwritten digit data setjLecun et al. 1998|; (c) the COIL rotational image data set 



Nene et al. 


1996], and (d) the Olivetti Faces data set Samaria and Harter 1994 


Becker et al. 


2010 , Friedman et al. 2007 , Kim et al. 2007 , 


Osborne et al. 


2000 



. There are many solvers 
which can be used to find 



the ground truth, i.e., the solution of problem M 
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4.1.1 Synthetic Data Sets 



We generate three synthetic data sets with different dimensions. For each of the cases, the entries of data 
matrix X and response vector y are independent identically distributed by a standard Gaussian. Each data 
matrix contains 100 samples with p = 50, 500, and 5000 respectively. For each case, once we generate the 
data matrix X, we compare the performance of DPP rules with Dome along a sequence of 100 parameter 
values equally spaced on the X/Xmax scale. Then we repeat the procedure 500 times and report the average 
performance of each rule. 




(a) x e ^ 100x50 (b) x e sRiooxsoo (c) x e ^ 100x5000 



Figure 1: Comparison of DPP rules and Dome on three synthetic datasets. Each column corresponds to 
each of the three synthetic data sets with different dimensions. 

The three subfigures of Fig. [I] correspond to the three different design matrices X and the average X max 
is 0.249, 0.315 and 0.371 respectively. As shown in Fig. [I] the performance of DPP is comparable to Dome 
but all the other DPP rules significantly outperform Dome. In contrast to Dome which performs better 



with larger X max [Xiang et al~ 2011 , DPP rules exhibit stronger capability in discarding inactive predictors 
when X max is small. The geometric intuition behind this observation is due to the fact that the sparser 
the predictors distribute over the unit ball, the longer the line segment of the regularization path is. If the 
length of the line segment of the regularization path is larger, the first few breakpoints may correspond to 
very small A values. 



4.1.2 MNIST Digit Data Set 



This data set contains grey images of scanned handwritten digits, including 60, 000 for training and 10, 000 
for testing. The dimension of each image is 28 x 28. We first randomly select 100 images for each digit (and 
in total we have 1000 images) and get a data matrix X G $ft 784x1000 . 

Similarly, we compare the performance of DPP rules with Dome along a sequence of 100 parameter values 
equally spaced on the X/X max scale. We repeat the procedure 500 times and report the average performance 
of each rule. In contrast to the case of synthetic data, the average A max is large (X max = 0.837) for the 
MNIST data set. As noted in Xiang et al. 2011 , Xiang and Ramadge 2012 , Dome is strong when X max 



is large. Fig. 2(a) shows Dome outperforms DPP and DPP2. But still, all the other DPP rules perform 
significantly better than Dome. 



4.1.3 COIL Rotational Object Image Data Set 

In this experiment, we consider the case where N ^> p and the predictors are highly correlated. The COIL 
data set includes 7200 images for 100 objects. We use object No. 13 with 72 color images of size 128 x 128 
taken every 5 degree by rotating the object. Each time, we take one of the images as the response vector y 
and use all the remaining images to construct the data matrix. Then we compare the performance of DPP 
rules and Dome along a sequence of 50 parameter values equally spaced on the X/X max scale. By using every 
image as response vector, we repeat the procedure 72 times. We transform each color image to a column 
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(b) COIL 



(c) Olivetti 



Figure 2: Comparison of DPP rules and Dome on three real datasets. MNIST digit data set (left), COIL 
image data set (middle) and Olivetti face data set (right). 

vector with 3 x 128 x 128 = 49152 elements. Therefore we obtain a data matrix X £ sft 49152x71 . The average 
X max is 0.988. 

As shown in Fig. |2(b)| Dome discards much more inactive predictors than DPP and DPP2 even for 

small A. But DPP5 significantly outperforms Dome. This is because the average parameter value of the h th 

—(5) 

breakpoint A is very small, and we know for any single run DPP5 can discard all predictors in the inactive 
set for A > A^. For the same reason, DPP 10 and DPP20 can identify almost all of the inactive predictors 
even for very small A. 



4.1.4 Olivetti Faces Data Set 

This data set includes 400 grey scale face images of size 64 x 64 for 40 people (10 for each). We sequentially 
take one of the images as response vectors and the left images to construct data matrix X. All images are 
converted to column vectors and thus y £ 5ft 4096 , X £ sft 4096x399 . We compare the performance of DPP rules 
and Dome along a sequence of 50 parameter values equally spaced on the \/\max scale. The average Xmax 
is 0.989. 

As shown in Fig. |2(c)| Dome outperforms DPP and DPP2. DPP5 discards more inactive predictors than 
Dome, especially for small A. As expected, DPP10 and DPP20 further improve DPP5. 



4.2 GDPPs for the Group Lasso Problem 




(a) 20 groups (b) 50 groups (c) 100 groups 

Figure 3: Performance of GDPP and SGDPP applied to three synthetic data sets with different number of 
groups. 

We apply GDPPs to three synthetic data sets. The entries of data matrix X £ ^ 100xl00 ° and the response 
vector y are generated i.i.d. from the standard Gaussian distribution. For each of the cases, we randomly 
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divided X into 20, 50, and 100 groups. We compare the performance of GDPP and SGDPP along a sequence 
of 100 parameter values equally spaced on the \/\ m ax scale. We repeat the above procedure 100 times for 
each of the cases and report the average performance. The average \ m ax values are 0.136, 0.167, and 0.219 
respectively. 

As shown in Fig. [3J it is expected that SGDPP significantly outperforms GDPP which only makes use 
of the information of the dual optimal solution at a single point. 

Remark: For the group Lasso problem, the feasible set of its dual variables is the intersection of ellipsoids 
and is thus no longer a polytope. As a consequence, the path of the optimal solution is no longer piecewise 
linear. Due to this fact, it is more complicated to characterize the path and find the breakpoints where 
groups of predictors enter or leave the active set. However, if there are efficient algorithms which can find 
the breakpoints and the corresponding parameters like LARS for Lasso, we can potentially make use of those 
breakpoints and the associated parameters to construct more effective screening rules based on Theorem [4j 

5 Conclusion 

In this paper, we develop new screening rules for the Lasso problem by making use of the nonexpansiveness 
of the projection operator with respect to a closed convex set. Our new methods, i.e., DPP screening rules, 
are able to effectively identify inactive predictors of the Lasso problem, thus greatly reducing the size of the 
optimization problem. The idea of DPP rules can be easily generalized to screen the inactive groups of the 
group Lasso problem. Extensive numerical experiments on both synthetic and real data demonstrate the 
effectiveness of the proposed rules. It is worthwhile to mention that DPP rules can be combined with any 
Lasso solver as a speedup tool. 

In the future, we plan to generalize our idea to other sparse formulations consisting of different loss 
functions, e.g., logistic/hinge loss, and more general structured sparse penalty, e.g., group/graph Lasso. 
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Appendix 

In this appendix, we will show the detailed procedure to derive the dual formulation of standard lasso and 
group lasso in sections |A| and |B) 



A Deviation of the Dual Problem of Standard Lasso 



A.l Dual Formulation 

Assuming the data matrix is X £ 3? 7Vxp , the standard Lasso problem is given by: 

^lly-x/3|II + A||/?||i 



(17) 



For completeness, we give a detailed deviation of the dual formulation of (17) in this section. Note that 



problem (17) has no constraints. Therefore the dual problem is trivial and useless. A common trick Boyd 



and Vandenberghe 2004 is to introduce a new set of variables z = y — X/3 such that problem (17) becomes: 

mf \\Hl + A||/?l|i (18) 
subject to z = y — X/3 

By introducing the dual variables 77 £ Sft^, we get the Lagrangian of problem (18): 

m z, v) = + A||/3||i + v T ■ (y - X/3 - z) 

For the Lagrangian, the primal variables are j3 and z. And the dual function g(rj) is: 
g^) = inf L(0, z, 77) = V T y + inf^X/? + A||/3||i) + inf (hzg - V T z) 

/3,Z (3 z y Z 

In order to get g(rj), we need to solve the following two optimization problems. 

inf -t? t X/3 - 



and 



inf-Hzlll-^z 

z Z 



Let us first consider problem (21). Denote the objective function of problem (21) as 

AGS) = -t? t X/? + AH/511!. 

is convex but not smooth. Therefore let us consider its subgradient 

a/ 1 (/3) = -X T r ? + Av 

in which HvHoo < 1 and v T f3 — \\ft\\i, i.e., v is the subgradient of ||/3||i. 
The necessary condition for fi to attain an optimum is 

3 13', such that G 9/i(/3') = {-X t t? + Av'} 

where v' G 9||/3'||i. In other words, j3',v' should satisfy 



A ' 



|v'||oo<l,V , V=ll0 / ||l 



(19) 

(20) 

(21) 
(22) 
(23) 
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which is equivalent to 



|xfr/| < \,i = 1,2, . 



Then we plug v' = ^ and v /7 > = ||£'lli into Eq. (|23|): 



/iCSO = inf AOS) = -rf X/3' + A(^) V = 

p A 



(24) 



(25) 



Therefore, the optimum value of problem (21) is 0. 



Next, let us consider problem (22). Denote the objective function of problem (22) as /2(z). Let us rewrite 
/ 2 (z) as: 

/ 2 (z) = Idlz-rjg-Wrjg) (26) 



Clearly, 



and 



z — argmin/2(z) = rj 



inf/ 2 (z) = --|M| 
z 2 



Combining everything above, we get the dual problem: 



which is equivalent to 



sup g(rj) =rj y- -||^|| 2 
subject to l-xjvl < \ i = 1> 2, . . . ,p 



sup <7(r?) = ^l|y||i-^-y||i 

subject to |xf 77 1 < A, i = 1, 2, . . . ,p 



(27) 



(28) 



By a simple re-scaling of the dual variables 77, i.e., let = ^, problem (28) transforms to: 

sup = ^ lly 111 -yll#- fill 

subject to |xf 0\ < 1, i = 1, 2, . . . ,p 



(29) 



A. 2 Relationship Between The Primal And Dual Variables 

Problem (Tl8J) is clearly convex and its constraints are all afflne. By Slater's condition, as long as problem 



Lagrangian is 



([18]) is feasible we will have strong duality. Denote /?*, z* and 0* as optimal primal and dual variables. The 

(30) 



W M) = i||z||i + AH/311! + X9 T ■ (y - X/3 - z) 



Prom the KKT condition, we have 



e df}L(/3*,z*,6*) = -AX J 0* + Av, in which IMU < 1 and v i /3* = ||/3*||i 



(31) 



V z L(/3*,z*,6>*) = z* - A6>* = 



VeL(/3*,z*, 9*) = A(y - X/?* - z*) = 



(32) 
(33) 
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Prom Eq. (32) and (33), we have: 



y = X/?* + A6»* 



From Eq. (|3TJ), we know there exists v* G <9||/?*||i such that 

x r r = v . } ii^i^ < ! and (v * } r r = || r ^ 

which is equivalent to 

|xf0*| < l,i = 1,2,..., p, and (0*) T X/T = 1 1 /5* || i 
From Eq. (35), it is easy to conclude: 



Q*\T. 



sign(/3)*if/3*^0 
[-1,1] if/3* = 



B Deviation of the Dual Problem of Group Lasso 
B.l Dual Formulation 

Assuming the data matrix is G $l Nxn 9 and p = Y^=i n g-> ^ ne g roil P Lasso problem is given by 

G G 



9=1 



9=1 



Let z = y — J2 g =i Xgfig an d problem (37) becomes: 



inf ^MIl + A^V^H^IIa 



9=1 



G 



subject to z = y — X^/^ 

9=1 



By introducing the dual variables r] G 5ft , the Lagrangian of problem (38) is: 

G 

Izll2 4- A 



L(/3, z, 77) = 1 ||z||i + A £ || 2 + rj T • (y - £ X,ft, - z) 

and the dual function #(77) is: 

C7 g 

g(rj) = ML(/3, z, v ) = ify + inf f - if £ X ff /3 P + A ^ V^II&IM + mf INI 

In order to get g(r]), let us solve the following two optimization problems. 

G G 
mf-f7 r £x fl /? fl + A£V%ll&ll2 



2 T 
2 



2=1 



and 



9=1 



inf-||z||^-^z 

z 2 
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Let us first consider problem (41). Denote the objective function of problem (41) as 

G G 
= ~V T E X A + A E v^ll^ II2 

3=1 3=1 



(43) 



Let 



fg(Pg) = -V T ^gPg + Ay^ll^lb, g = l,2,...,G 



then we can split problem (41) into a set of subproblems. Clearly f g (/3 g ) is convex but not smooth because 
it has a singular point at 0. Consider the subgradient of f g , 

dfgW g ) = -^V + ^v g , g=l,2,...,G 
where is the subgradient of ||/3p||2* 



wfo if ^° 
u, ||u|| 2 <l if/? s = 



(44) 



Let P' be the optimal solution of f g , then /?' satisfy 



3v'ed\\p'h, -x^ + AynX = o. 



g y g 

If ^ = 0, clearly, f g (P q ) — 0. Otherwise, since \^/n g ~V q = and v' g = prr^ , we have 



0. 



All together, we can conclude the 



inf/ fl (/3 fl ) = 0, 5 = 1,2,...,G 

Pa 



and thus 

G G 
inf /(/J) = inf Y^fgiPg) = XV = «■ 

P P -, -, Pg 

The second equality is due to the fact that are independent. 

Note, from Eq. (44), it is easy to see ||v p ||2 < 1. Since \ y fn g ~M , g = XJ77, we get a constraint on 77, i.e., 77 
should satisfy: 

l|X^|| 2 <Av^, # = 1,2,...,G. 



Next, let us consider problem (42). Since problem (42) is exactly the same as problem (22), we conclude: 

z' argmini||z||^ - r] T z 77 

z ^ 



and 



Therefore the dual function g(rj) is: 



infi||z|||-, 7 T z = -i|| J7 ||| 
z 2 2 



g(rj) =r] T y- -\\rj\\l 
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Combining everything above, we get the dual formulation of the group Lasso: 

sup g(rj) = V T y~ h\v\\i 
subject to HXJ77H2 < Ay^, g = 1, 2, . . . , G 



which is equivalent to 



1 



1, 



sup g(rj) = -||y|| 2 - -\\v ~ yh 
subject to HXJ77H2 < Ay^, g = 1, 2, . . . , G 



A ' 

sup .9(0) = §||y||!-y| 



By a simple re-scaling of the dual variables 77, i.e., let 6 = problem (46) transforms to: 

1, 



subject to ||Xj0||2 < \A^5 ^ = 1? 2, . . . , G 



(45) 



(46) 



(47) 



B.2 Relationship Between The Primal And Dual Variables 

Clearly, problem ([38]) is convex and its constraints are all affine. By Slater's condition, as long as problem 



(38) is feasible we will have strong duality. Denote /?*, z* and #* as optimal primal and dual variables. The 



Lagrangian is 



G G 

m m) = « hi + a J2 v^wPoh + a# t • (y - Yl x ^ - *) 



5=1 



^=1 



From the KKT condition, we have 



0€d Pt HF,z t ,6') = -XX*6 t +\ y /rq;v g , in which v s G d\\/3* g \\ 2 , g = l,2,...,G 



(48) 



(49) 



v z L(/3*,z*,e*) = z * -\e* = 

G 

V L{(3\ z*, 6*) = A • (y - £ X 9 /3 g * - z*) = 

3=1 



Prom Eq. (50) and (51), we have: 



y = ^X^ + A0* 

9=1 

From Eq. (49), we know there exists v' E <9||/3*||2 such that 



9 1 



and 



V 5 e 



II/8JII2 



if /?* ? 



k u, ||u|| 2 <i if /?; = 



Then the following holds: 



9 l^u, ||u|| 2 <lif/3 o *=0 



(50) 
(51) 

(52) 



(53) 



for g = 1, 2, . . . , G. Clearly, if ||X^ 0*\\2 < \/%' we can concm de /3* = 0. 
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